In order to be analyze relationships between CPI (consumer price
index) and unemployment rates across time, the variables we will keep
are: the CPI variable (cpi), the unemployment rate variable
(unemployment_rates), one variable each for year and month
(year and month). After using the annual
average column (annual) to fill in the missing monthly
unemployment data for September 1953, as noted below, I discarded that
variable. Similarly, I discarded the half1_half2 variable.
However, if I wanted to analyze annual averages over time, I would have
kept both of those variables, and wrangled the half1_half2
variable to provide an annual CPI average to contrast with the yearly
unemployment rate average.
I found several peculiarities through multiple sanity checks. In table 1, the row for the year 1947 was not located sequentially (prior to 1948 in ascending order as the rest of the years were arranged), but in fact was located in the dataset between the years 1967 and 1968, so it wasn’t immediately clear from a quick visual scan that the data began in 1947. Table 1 was also missing data for the year 1984.
In table 2, there is a negative value for April 1977, and September 1953 data was missing. The negative value in 1977 is peculiar as it is nonsensical to have a negative unemployment rate. To deal with both problems, it was the case that since we have the data from the other 11 months in both years, and the annual averages for both years, it was possible to calculate what the missing values were.
Additionally, all the values for 1961 are 0.0. Based on the rest of the data, it seems unlikely that these are the true values for 1961. It might be reasonable to consider this missing data, but I decided to keep it for now prior to doing any analysis. Based on what analysis I decided to do later, I might decide to treat this year as missing data and drop it.
Another peculiarity is that I noticed that in table2.xlsx, the data
for each month is given with one decimal place, but the data under the
“Annual” column is given as integers. Assuming that the “Annual” column
in table 2 indicates the annual average unemployment rate for that year,
it would be important to make adjustments for the rounding that occurs.
For example, in the year 1948 in table2.xlsx, the average unemployment
rate as calculated directly from the 12 months of data is 3.8, but the
Annual variable is given as 4. Fascinatingly, once the .xlsx file was
read into the .rmd file with the read_xlsx() function, the
values in the Annual column became a double type variable,
with values given to one decimal point, rather than integers as in the
.xlsx file.
In order to deal with handling NA values, since there
was no data from 1947 in table 2 and no data from 1984 in table 1, I
deleted the row of data from 1947 from table 1 and the 1984 row from
table 2 so as to not end up with years in the final tidy dataframe that
contained data only about CPI and nothing about unemployment rates, or
vice versa, as for any purposes of comparison of CPI and unemployment
rates, those years would be useless.
In Fig. 1.2, we learn about the distribution of singer heights within each of the eight categories by histograms that show the percentage of singers within a category that are of a certain height. For example, we can see that approximately 20% of the first bass singers are 70 inches tall, and if we want to know the percentage of first sopranos that are 65 inches tall, we can find out that the proportion is approximately 40%. Thus, we can visualize roughly what the most “popular” heights are within each category of singer, and compare the distributions to each other.
In contrast, Fig 2.1 displays similar information, but in the form of quantile plots rather than histograms. From these plots we can identify exactly which height corresponds to different quantiles we might be interested in. For example, we can identify that for the second basses, the 0.5 quantile, also known as the median, lies at around 72 inches, which means that half of the second basses are less than or equal to 72 inches tall. We can also find out, for each specific height within a category, the count for how many singers have that height.
In Fig 1.2, it is easier to immediately visualize what the distribution of heights within a category looks like, in terms of the general shape of the histogram and where along the spectrum of height it lands in terms of minimum and maximum values. For example, the first tenors are of relatively similar heights with similar proportions across different heights, but we see a much larger inequality of proportions in the first sopranos, where approximately 40% are 65 inches tall, but several heights are represented only by 1-2% of the individuals. But in Fig 1.2, without knowing the total number of singers in each group (which is not given), it is impossible to tell in terms of counts, how many singers have each height. It is also difficult to impossible to tell where the landmark quantiles such as 0.25, 0.5, and 0.75 lie.
In contrast, we can see where the quantiles lie very easily in Fig 2.1, not only the three major ones, but also a very good estimate for any at all between 0 and 1. Additionally, by counting the points that align with each height, we can also tell the counts of how many singers have each height, and of the total number of singers in each group. Though in this case the eight groups seem to have relatively similar numbers of members, in some cases there might be significant discrepancies in group sizes, which we would be able to see from the quantile plots, but not at all from Fig 1.2, which relays only information about percentages of the group.
Extrapolation refers to estimating the value of a variable that we do not have actual observed data about based on the relationships between variables that we do have data about, where these new estimated values fall outside of the range of the observed data. Interpolation is a similar concept, but refers to the estimation of unknown values that lie within the range of the observed data. Since it involves estimation of values beyond the range of observed data (both higher and lower than our range), extrapolation involves more dangerous assumptions than interpolation and is more likely to lead to incorrect or nonsensical estimations.
Interpolated and extrapolated quantile values should be used when we need a quantile whose f-value does not happen to correspond to any of the values of \(f_i\) we have in our observed data.
In the singers data, we have \(f_1 = 0.024\), which means that we have an observed data point that gives us exactly the 0.24th quantile, but since \(f_2 = 0.07\), if we wanted to know what f-value would give us, for example, the 0.03rd quantile, we would have to interpolate.
Similarly, if we wanted to know what f-value would give us the 0.01 quantile, we would need to extrapolate, as that value would fall below the range of our data.
| x1 | x2 | x3 | x4 | y1 | y2 | y3 | y4 |
|---|---|---|---|---|---|---|---|
| 10 | 10 | 10 | 8 | 8.04 | 9.14 | 7.46 | 6.58 |
| 8 | 8 | 8 | 8 | 6.95 | 8.14 | 6.77 | 5.76 |
| 13 | 13 | 13 | 8 | 7.58 | 8.74 | 12.74 | 7.71 |
| 9 | 9 | 9 | 8 | 8.81 | 8.77 | 7.11 | 8.84 |
| 11 | 11 | 11 | 8 | 8.33 | 9.26 | 7.81 | 8.47 |
| 14 | 14 | 14 | 8 | 9.96 | 8.10 | 8.84 | 7.04 |
| dataset | x | y |
|---|---|---|
| 1 | 10 | 8.04 |
| 1 | 8 | 6.95 |
| 1 | 13 | 7.58 |
| 1 | 9 | 8.81 |
| 1 | 11 | 8.33 |
| 1 | 14 | 9.96 |
To tidy the anscombe dataset, I first used the
pivot_longer() function to pivot all columns so that there
was a column called variable with the x_i
ory_i character string as values, and another column called
value with the value associated with each x_i
ory_i. I arranged the rows in ascending order so that all
of the x_1 values were listed first, next the
x2, etc. Next I created a new column called
y_value that duplicated all of the values in the
value column, a column which would eventually end of as
storing the y values. I then used the
shift.column() function from the useful
package to shift all the values associated with the y_i
rows, 44 rows up, to align the values in the correct rows corresponding
to the same x_i pair it should be associated with. This
meant that for each x_i, y_i observation, the associated
x and y values would be stored in the same
row, each in their own column. Then I renamed some variables for
clarity. Finally, I created a new variable dataset that
listed whether a pair was part of dataset 1, 2, 3, or 4, based on
whether it had come from an x_1, y_1,
x_2, y_2, x_3, y_3, or x_4, y_4
value pair respectively.